Skip to main content

Forecasting

Now that we have the cleaned dataset with the appropriate features, we can finally perform forecasting. This forecasting is just regression, with the exception that train data must be sequential upto a date and test data must be the sequence following that.

For example, train data can be from 2018-01-01 to 2022-12-01, validation data can be from 2023-01-01 to 2023-12-01, and test data can be from 2024-01-01 to 2024-12-01.

  1. First, a new column is made for forecast values and its historical values until end of training period is temporarily set to actual data.

    df["forecast"] = 0
    df.loc[(df.index < "2023-01-01"), ["forecast"]] = df.loc[(df.index < "2023-01-01"), ["quantity_delivered"]].values

    Note that throughout this example, the training lasts until 2022-12-01 and validation is from 2023-01-01 to 2023-12-01.

  2. Next, the training data is extracted, dependent and independent features are set and model is fit on this training data. The model can be any regressor with fit() and predict() methods, but it has been found that StackingRegressor() with Bagging and Boosting regressors (RandomForestRegressor, BaggingRegressor, CatBoostRegressor, etc) as the base estimators and HuberRegressor as the final estimator gives the best results.

    The exact combination of estimators in StackingRegressor for optimal results can vary for countries and brands.

    train = df[df.index < "2023-01-01"]
    x_train, y_train = train.drop(["quantity_delivered", "forecast"], axis=1), train["quantity_delivered"]
    model.fit(x_train, y_train)
  3. Now that the model is fit, it is used to perform predictions. These predictions are done in an autoregressive manner. In the following snippet, autoregression() is a function that does the autoregression for lags as well as actually predicting for each time period, and decompose() is a function that recomputes the trend and seasonality as explained in the Feature Engineering section.

    for time in times:
    df = autoregression(df, model, time)
    if 'trend_1' in df.columns:
    df = decompose(df, time)

    Note that decompose() is called only if the seasonal features exist in the first place - they may not be engineered for some countries/brands.

  4. Finally, the forecasts for training data are replaced with the actual forecasts for the period. They were set to the actual values previously to make autoregression more convenient.

    y_pred = model.predict(x_train)
    y_pred[y_pred < 0] = 0
    df.loc[df.index < "2023-01-01", ["forecast"]] = y_pred

The entire process from preprocessing to forecasting is performed by my library tsf.